开源方案学习
2025年5月28日
10:50
关于diffvg optimization,diffvg是一个 differentiable rasterizer. 需要看看代码,有了需要调整的参数,有了可微分的pipeline,有了loss,具体是如何 hill climb的?
总结:1.直接fine-tune SD,lora, flowGRPO等
2.不fine-tune SD,用diffvg optimazation 调整svg的参数如path等参数
3.hack ocr
3rd place solution: VQA/AES=0.81/0.64 Diffusion model + differentiable SVG optimization 有源码
1. 用flux.1-schnell生成图片,多个prompt To increase diversity and quality,
2. 先用vtracer 初步将bitmap转换为svg
3. 用siglip得分选取得分最高的2张图片作为候选,然后经过简单的optimization选取aesthetic score得分最高的svg,进行进一步的optimazation。
4.用diffvg 来优化svg的参数,SVG's parameters (control points of paths, fill colors, stroke colors, and stroke widths) were optimized using pydiffvg。
5.由于比赛官方对图片进行了denfense处理,如随机裁剪,滤波,压缩,这些是不可微分的undiffentiale,于是自定义了可微分的算子,来模拟比赛官方的处理,从而使得梯度可以流到svg的参数中。
6.一些后处理
1st Place Solution
关键点:用来进行Ocr检测的图片和进行vqa和aes评估的图片是不一样的!!!!!进行Ocr检测的图片是用原始的图片。
如果图片是一致的,也可以hack,我提问的。

5th place solution - VTracer and DiffVG optimization 有源码
My final solution is based on generating a bunch of images with SDXL-turbo, converting them to SVGs, and then adding an optimized patch to significantly improve the aesthetic score. After that, I use a proxy function for the VQA model to find the best SVG to use in the end.
通过diffvg optimization找到能最大提升aes score的patch,并放在图片的特定位置上,否则这个patch会影响vqa score.

对pipeline的优化:
For example, whenever a few images are ready from SDXL, we can start converting while we keep generating more images in parallel. The same for scoring when a few SVGs are ready.
After generating all the SVGs, the selection of the best one is done using a proxy function over the VQA score. I simply followed the method described in this paper and used the "Yes" vs "No" probability for this question: "f<image>answer en Question: Does this image contain {description}? Answer Yes or No.\nAnswer:".
4th Place Solution: SD3.5M + DRaFT + diffvg
My pipeline first generates images with Stable Diffusion 3.5 Medium, fine-tuned with DRaFT to produce high-quality, score-optimized outputs. The images are then heuristically converted to rough SVGs, which are further refined using diffvg, a differentiable rasterizer.
用DRaft lora微调SD,用GPT-4o生成数据集,微调的reward:
Reward functions: I used two general reward functions that I knew worked well from my previous work on DRaFT:
These models were trained on human preference data. I also added three competition-specific rewards:
同时用了diffvg optimization,The optimization minimizes the L1 distance between the rendered SVG and the original diffusion image using gradient descent on the locations/colors of the paths. 也就是主要优化bitmap-->svg转换这一步。
12th Place Solution: SD3.5M + FlowGRPO
直接用grpo fine-tune sd, 基于流匹配,因此不适用于SDXL。
Unlike the approaches of many other teams, my method focused on directly fine-tuning Stable Diffusion. The goal was to generate images that were not only suitable for SVG conversion but also aligned with the preferences of the VQA evaluator. Given that the competition's evaluation metric was determined by PaliGemma and an aesthetic score, traditional preference-based methods like DPO (Direct Preference Optimization) were less suitable. During my research, I discovered the recently open-sourced Flow-GRPO [1], which perfectly met my needs. In essence, FlowGRPO employs a reinforcement learning strategy (GRPO - Group Relative Policy Optimization) to train Stable Diffusion by directly aiming to maximize the specified evaluation metric.
Reward Function:
For more direct optimization, I employed the competition metric itself as the reward function: the harmonic mean of the VQA score and the aesthetic score.
8th Place Solution
We made most use of the infrastructure by utilizing both GPUs in parallel to generate images and to score in a single thread for each.
是的,如何同时利用两块GPU呢,开两个线程,需要研究一下。
13th Place Solution - A Kaggle beginner's attempt
这个跟我是最像的,而且是我的方案的改进版本!!
OCR mitigation
用的是字母a,而不是圆圈,可能更能通过OCR检测。。
SVG压缩
我没有用到紧凑简写h1, v24,当时不知道
nunchaku包的安装
通过subprocess安装 。。。
import subprocess
nunchaku_whl_path = kagglehub.dataset_download("howarudo/nunchaku-whl")
subprocess.run(["pip", "install", f"{nunchaku_whl_path}/nunchaku-0.2.0-cp310-cp310-linux_x86_64.whl"])
第三名的方案里,是这样实现的:
path_nunchaku = kagglehub.model_download("cnumber/nunchaku/pyTorch/default")
sys.path.append(path_nunchaku)
import nunchaku
model里是已经pip 安装好的包,有EGG-INFO文件夹,pydiffvg文件夹,还有一个so文件,感觉是复制pip 安装好的文件夹。
Dummy VQA
我没有用到多选,只是yes 和no,
question = [
f'Does this image primarily represent "{description}"?',
f'Is "{description}" clearly shown in the image??'
]
choices = [
['definitely', 'somewhat', 'barely', 'no'],
['yes', 'somewhat', 'no']
]
answer = [
'definitely',
'yes'
]
而且跟我一样,也建立了回归方程,事实证明是有用的,比自己拍脑袋设置权重要好。
VQA= w_1 * VQA_{dummy_1} + w_2 * VQA_{dummy_2} + b
已使用 OneNote 创建。